Table of contents extraction with improved robustness

ABSTRACT

In a method for identifying a table of contents in a document ( 10 ), text fragments are extracted ( 12 ) from the document. There are identified ( 20, 30, 34, 38 ): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion ( 130 ). The identified table of contents entries and linked text fragments ( 110 ) are validated based on at least one validation criterion ( 162 ) related to distribution of the linked text fragments.

BACKGROUND

The following relates to the information storage and processing arts. Itfinds particular application in conjunction with cataloging of legacydocuments in a marked-up format such as extensible markup language(XML), standard generalized markup language (SGML), hypertext markuplanguage (HTML), or the like, and will be described with particularreference thereto. However, it is to be appreciated that the followingis amenable to other like applications.

There is interest in the information storage and processing arts inconverting document databases to a common structured format that isstructured based on document content so as to facilitate searching,document categorizing, and so forth. Some suitable structured documentparadigms include XML, SGML, HTML, or so forth.

The content of unstructured documents is sometimes arranged by a tableof contents that sets forth a document structure employing chapters,sections, or so forth. Thus, there is interest in developing methods andapparatuses for extracting the table of contents from the document, andusing the extracted table of contents as a framework for structuring thedocument.

Some existing methods and apparatuses for extracting tables of contentfrom unstructured documents rely upon detecting document headings havingdistinctive font sizes, boldfacing, or so forth that can be detected andassociated with table of contents entries. If the unstructured documentis paginated, then table of contents extraction may rely upon eachsection indexed in the table of contents starting on a new page.However, this approach can be problematic if the paginated documentincludes header information at the top of each page.

The reliability of existing table of contents extraction algorithms canbe relatively good, but is less than perfect. Algorithms for identifyinga table of contents and associated links to chapter headings, sectionheadings, or so forth can generate incorrect linkages, missed table ofcontent entries, or so forth. For example, the content of a heading maybe repeated in the body of the chapter or section, creating ambiguity asto which portion of content should be linked. Complex documents mayinclude multiple copies of the table of contents, for example one copyin each volume of a multi-volume document. In such cases, there is apossibility that the extraction algorithm may incorrectly cross-linkbetween the table of content entries. If the source document is anoptically scanned document processed by optical character recognition(OCR), then the resulting electronic document may include textual errorsthat can lead to erroneous linkages.

Accordingly, there is a continuing need in the art for improved methodsand apparatuses for enhancing the robustness of table of contentsextraction techniques.

INCORPORATION BY REFERENCE

The following are commonly assigned U.S. patent applications, each ofwhich is incorporated herein by reference.

Meunier, “Rapid Similarity Links Computation for Table of ContentsDetermination” (Xerox ID 20051677-US-NP, Ser. No. ______ filed ______)is incorporated herein by reference in its entirety. This applicationrelates at least to determining similarity links in constructing a tableof contents.

Dejean et al., “Structuring Document based on Table of Contents,” Ser.No. 11/116,100 filed Apr. 27, 2005 is incorporated herein by referencein its entirety. This application relates at least to organizing adocument as a plurality of nodes associated with a table of contents.

Dejean et al., “Method and Apparatus for Detecting a Table of Contentsand Reference Determination,” Ser. No. 11/032,814 filed Jan. 10, 2005 isincorporated herein by reference in its entirety. This applicationrelates at least to a method for identifying a table of contents in adocument. An ordered sequence of text fragments is derived from thedocument. A table of contents is selected as a contiguous sub-sequenceof the ordered sequence of text fragments satisfying the criteria: (i)entries defined by text fragments of the table of contents each have alink to a target text fragment having textual similarity with the entry;(ii) no target text fragment lies within the table of contents; and(iii) the target text fragments have an ascending ordering correspondingto an ascending ordering of the entries defining the target textfragments.

Dejean et al., “Method and Apparatus for Detecting Pagination ConstructsIncluding a Header and a Footer In Legacy Documents,” Ser. No.11/032,817 filed Jan. 10, 2005 is incorporated herein by reference inits entirety. This application relates at least to detecting header andfooter content in a document.

Dejean et al., “Method and Apparatus for Structuring Documents based onLayout, Content and Collection,” Ser. No. 11/033,016 filed Jan. 10, 2005is incorporated herein by reference in its entirety. This applicationrelates at least to conversion of a document in a relatively flat layoutto a structured document in a hierarchal form.

Meunier, “Method and Apparatus for Determining Logical DocumentStructure,” Ser. No. 11/137,566 filed May 26, 2005 is incorporatedherein by reference in its entirety. This application relates at leastto processing documents having text arranged in multiple columns on apage.

Chidlovskii et al., “Systems and Methods for Converting Legacy andProprietary Documents into Extended Markup Language Format,” Ser. No.10/756,313 filed Jan. 14, 2004 is incorporated herein by reference inits entirety. This application relates at least to conversion of legacyand proprietary documents into extended mark-up language format whichtreats the conversion as transforming ordered trees of one schema and/ormodel into ordered trees of another schema and/or model.

BRIEF DESCRIPTION

According to certain aspects illustrated herein, a method is disclosedfor identifying a table of contents in a document. Text fragments areextracted from the document. There are identified: (i) a substantiallycontiguous group of text fragments as table of content entries and (ii)a different group of text fragments as linked text fragments linked withcorresponding table of content entries. During the identifying, a numberof text fragments that are candidates for identification as linked textfragments is reduced based on at least one reduction criterion.

According to certain aspects illustrated herein, a method is disclosedfor identifying a table of contents in a document. Text fragments areextracted from the document. There are identified: (i) a substantiallycontiguous group of text fragments as table of content entries and (ii)a different group of text fragments as linked text fragments linked withcorresponding table of content entries. The identified table of contentsentries and linked text fragments are validated based on at least onevalidation criterion related to distribution of the linked textfragments.

According to certain aspects illustrated herein, an apparatus isdisclosed for identifying a table of contents in a document. A textfragmenter extracts text fragments from the document. A table ofcontents region identifier identifies a contiguous sub-set of the textfragments as a table of contents region. A table of content extractoridentifies (i) a substantially contiguous group of text fragments withinthe table of contents region as table of content entries, and (ii) adifferent group of text fragments as linked text fragments linked withcorresponding table of content entries.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 diagrammatically shows an apparatus for identifying a table ofcontents.

FIG. 2 shows a similarity matrix for a document consisting of fifteentext fragments.

FIG. 3 diagrammatically shows an identified table of contents.

FIG. 4 shows the similarity matrix of FIG. 2 with a portion blocked outdue to range restrictions placed on the table of contents.

FIG. 5 diagrammatically shows a situation in which a substantiallycontiguous group of table of content entries is erroneously associatedby a table of contents extractor with linked text fragments in a copy ofthe table of contents.

FIG. 6 diagrammatically shows a situation in which a substantiallycontiguous group of table of content entries is partially correctlyassociated by a table of contents extractor with linked text fragmentsthat are section headings, and is partially erroneously associated bythe table of contents extractor with linked text fragments in a copy ofthe table of contents.

DETAILED DESCRIPTION

Various techniques can be used for extracting a table of contents from adocument. Typically, the output of a table of contents extractor is aset of text fragments (possibly represented by pointers to points withinthe document) corresponding to table of content entries, each of whichis coupled with a linked text fragment (again, possibly represented by adocument pointer) indicating the corresponding chapter heading, sectionheading, or other heading. The links between table of content entriesand linked text fragments (e.g., headings) can be identified based onvarious criteria, such as use of distinctive heading font size and/orfont style, arrangement of text fragments on a page, or so forth.

In the following, an illustrative example table of contents extractionapproach based on textual similarity of text fragments, rather based onfont characteristics, physical page layout, or so forth, is set forth.Insofar as font characteristics, page layout, and so forth may be lostor modified during document conversion processes or when the document isstored in certain formats (such as plain text), the example textualsimilarity-based table of content extractor has certain advantages interms of robustness. Methods and apparatuses for improving therobustness of table of contents extraction are illustratively set forthherein with example reference to the illustrative textualsimilarity-based table of content extractor; however, it is to beunderstood that the methods and apparatuses for table of contentsextraction with improved robustness disclosed herein will generally findapplication in conjunction with various types of table of contentsextraction techniques such as those based on font characteristics, thosebased on page layout, or so forth.

With reference to FIG. 1, an apparatus is described for identifying atable of contents in an unstructured document 10. A text fragmenter 12breaks the unstructured document 10 into an ordered sequence of textfragments 14. Typically, the unstructured document 10 is loaded as alist of text strings from a text or XML file produced from a document inan input format (such as Adobe PDF, Word FrameMaker, or so forth), usingan off-the-shelf converter. A paper document is suitably scanned usingan optical scanner and processed by optical character recognition (OCR).For a text document, each line suitably becomes a fragment ordered lineby line. For an XML or HTML document, each PCDATA suitably becomes atext fragment.

Several strategies can be used to order the textual fragments:depth-first left-to-right traversal (document order) or use of thefragment position in the page. Also, the relationship between XML nodesand text fragments can be preserved in order to map the detected tableof contents and references back onto XML nodes at the end of theprocess. It is to be appreciated that the text fragmenter 12 canfragment the textual content in lines, blocks, series of words of aline, or even may split a word across two text fragments (for example,due to a different formatting on the first character of the first wordof a title). The resulting ordered sequence of text fragments 14 isprocessed by a textual similarity links identifier 20 that identifieslinks 22. Each link is defined by a pair of textually similar textfragments. The text fragments of the pair defining the link areidentified herein as source and target text fragments. The source textfragment is a candidate for being a table of contents entry, while thetarget text fragment is a candidate linked text fragment correspondingto the candidate table of contents entry.

There are various ways of defining such pairs of text fragments. Ingeneral, for N fragments, the computation of links is of order O(N²).Additionally, the possible presence of noise in the text should beaccounted for. Noise can come from various sources, such as incorrectPDF-to-text conversion, or table of contents-specific problems such as apage number that appears in the table of contents but not in thedocument body, or a series ellipses ( . . . ) that relate the pagenumber to the section title in the table of contents. In someembodiments, each text fragment is tokenized into a series ofalphanumeric tokens with non-alphanumeric separators such as tabs,spaces, or punctuation signs. In some embodiments, a Jaccard is used tomeasure textual similarity. The Jaccard is computed as the cardinal ofthe intersection of the two token sets defined by candidate source andtarget text fragments divided by the cardinal of the union of these twotoken sets. A link is defined for those pairs in which the Jaccardmeasure is above a selected matching threshold. In other embodiments anedit distance or other suitable measure is used as the textualsimilarity comparison. For an edit distance measure, the threshold is amaximum—those pairs having an edit distance less than an edit distancethreshold are designated as textually similar pairs.

With brief reference to FIG. 2, the textual similarity links aresuitably visualized using a similarity matrix 100. Designating as (#i,#j) a link between a source text fragment #i and a target fragment #j,if a link (#i, #j) satisfies the threshold or other link selectioncriterion, then the link (#j, #i) also satisfies the threshold or otherlink selection criterion. Thus, the similarity matrix elements need onlybe computed for the upper-right half (or equivalently, lower left-half)of the similarity matrix 100. In FIG. 2, links in which the computedJaccard exceeds a selected threshold are indicated by “X” marks in thelink cells. Moreover, although not shown in FIG. 2 it will beappreciated that each link exceeding the threshold has an associatedJaccard or other metric value that indicates the strength of the link interms of textual similarity.

With reference to FIG. 3, a table of contents 110 represents acontiguous sub-sequence of the ordered sequence of text fragments 14.Four general criteria are used to distinguish and identify the table ofcontents 110 within the ordered sequence of text fragments 14. The firstcriterion is contiguity. The table of contents includes a contiguoussub-sequence of the ordered sequence of text fragments 14. Most of thetext fragments of this contiguous sub-sequence are expected to beentries 112 of the table of contents. Each table of content entry islinked to a portion of the text that follow the table of contents by oneof the links 22. These links that are associated with the table ofcontents 110 are indicated in FIG. 3 as curved arrows 114. It is to beappreciated that the links 114 of the table of contents 110 are asub-set of the links 22 computed by the textual similarity linksidentifier 20. However, the links 22 typically include many links inaddition to the sub-set of links 114. The sub-set of links 114 denotelinked text fragments that correspond with table of content entries.

Although most of the text fragments of the table of contents 110 areentries 112, a small portion of the text fragments in the contiguoussub-sequence of text fragments defining the table of contents 110 may beholes, rather than entries 112. The holes do not have associated links114, and do not represent a table of contents entry linking to anotherportion of the document. An example hole 116 is shown in FIG. 3.Typically, a ratio of the number of holes to the number of entries isless than about 0.2. In some embodiments, the maximum acceptable numberof holes is a user-selectable parameter. Thus, the table of contententries form a substantially contiguous group of text fragments in thesub-sequence 14.

The second criterion is textual similarity. Each link 114 should connectan entry 112 to a heading or other linked text fragment having text thatis similar to the text of the entry. The textual similarity is suitablymeasured by the Jaccard or other text similarity measure employed by thetextual similarity links identifier 20. The target or linked textfragment is typically a section heading in the case of an ordinary tableof contents. In the case of a table of contents listing figures of thedocument (i.e., a table of figures) the target or linked text fragmentmay be a figure caption. In the case of a table of contents listingtables of the document (i.e., a table of tables) the target or linkedtext fragment may be a table caption. In some documents the figureand/or table captions may be below the corresponding figures and/ortables.

The third criterion is ordering. The target or linked text fragments ofthe links 114 should have an ascending ordering corresponding to theascending ordering of the entries 112. That is, for a set of entries{#i₁, #i₂, #i₃, . . . } having a set of links {(#i₁,#j₁), (#i₂,#j₂),(#i₃,#j₃), . . . } where the set of entries {#i₁, #i₂, #i₃, . . . } havean ascending ordering, it should follow that the ordering of thecorresponding set of target fragments {#j₁, #j₂, #j₃, . . . } is alsoascending.

The fourth criterion is lack of self-reference. All of the links 114should initiate from within the table of contents 110, and none of thelinks 114 should terminate within the table of contents 110. The set ofentries {#i₁, #i₂, #i₃, . . . } and the corresponding set of target textfragments {#j₁, #j₂, #j₃, . . . } should have an empty intersection, andmoreover none of the target text fragments {#j₁, #j₂, #j₃, . . . }should correspond to a hole text fragment in the table of contents 110.

With reference to FIG. 4, the lack of self-reference along with theobservation that the table of contents typically occurs near thebeginning of the document can be used to further reduce the search areaof the similarity matrix 100. If the table of contents is assumed to beonly in the k first and last text fragment, this leaves out the centerof the matrix, as shown in dark gray square 120 in FIG. 4 for k=5.

With returning reference to FIG. 1, one approach for selecting fromamongst the ordered sequence of text fragments 14 the table of contents110 including a contiguous sub-sequence of the ordered sequence of textfragments is described. The table of contents is selected based on thecontiguity, text similarity, ordering, and non-self-referencingcriteria. A candidate tables of contents selector 30 selects one or morecandidate tables of contents. In one suitable approach, N hypotheses aretested, corresponding to N candidate starting text fragments for thetable of contents. For each of the N possible starting fragments, thehypothesis “Could the table of contents start at this text fragment?” istested. In some suitable embodiments, the testing starts at thecandidate starting text fragment and then looks at each subsequent textfragment in turn to consider it for inclusion in the candidate table ofcontents. The candidate table of contents is extended by addingsubsequent contiguous text fragments until the addition of a new textfragment breaks the ordering constraint. For example, if last added textfragment is a source text fragment having links to target fragments#j=15 and #j=33, and the next text fragment under consideration is asource text fragment having a link only to target fragment #j=20, thenthis next text fragment can be added to the candidate table of contentssince #j=20 is greater than #j=15. If, however, the next text fragmentis a source text fragment only having a link to target fragment #j=12,then this would break the ordering. However, it is advantageous to relaxthe ordering constraint somewhat to allow for a few holes in thecandidate table of contents. This is suitably achieved by permitting thepresence of a certain number of text fragments without any associatedlinks, and by permitting a certain number of fragments withlink-crossing, that is, a text fragment for which all of its associatedlinks break the ordering constraints in the candidate table of contents.Allowing some link-crossing is useful if for example the previous textfragment in the current candidate table of contents contained only onelink pointing too far ahead in the document.

To enforce the non-self-referencing constraint, a second pass issuitably performed once the extent of the candidate table of contents istentatively determined with respect to the ordering constraint. Using asecond pass accounts for indeterminacy as to the end of the table ofcontents, as the table of contents end is unknown while it is beingextended from its start point. The second pass starts at the originalstarting text fragment at the top of the candidate table of contents.Each subsequent text fragment is tested. If a subsequent text fragmentincludes links only to text fragments within the table of contents, thenit violates the non-self-referencing criterion—accordingly, the secondpass would terminate the candidate table of contents just before thatnon-self-reference violating text fragment. Again, however, it may beadvantageous to allow a certain number of holes. This is suitablyachieved in the second pass by allowing one or a few text fragments ofthe candidate table of contents to be self-referencing. These textfragments that violate the self-referencing criterion are assumed to beholes, rather than entries, in the candidate table of contents.

This processing is repeated for each of the N possible starting textfragments. The result is a set of one or more candidate tables ofcontents, each formed of a contiguous list of text fragments. Becausethe candidate tables of contents were constructed in a way that ensuresthat the ordering and non-self-reference constraints can be obeyed(while optionally allowing for a limited number of holes), it followsthat it is possible to select for each entry of the candidate table ofcontents one link from its list of acceptable links so that the orderingand non-self-reference constraints are respected. Optionally, filteringcriteria may be applied by the candidate tables of contents selector 30to remove candidate tables of contents that are clearly not correct. Forexample, a candidate table of contents having fewer than three textfragments may suitably be discarded.

A table of contents selector 34 ranks the candidate tables of contentsand selects the highest ranked table of contents. A links optimizer 38then optimizes the links for that selected table of contents to producethe final selected table of contents 110. In this approach, the table ofcontents selector 34 ranks the candidate tables of contents using acomputationally efficient ranking method, and only the selected table ofcontents then undergoes the more computationally intensive processingperformed by the links optimizer 38. This approach has been found to besatisfactory for accurately selecting the table of contents. A morecomputationally intensive approach is also contemplated, in which thelinks optimization is performed for each of the candidate tables ofcontents and the ranking takes into account the results of the linksoptimizations.

With continuing reference to FIG. 1, the table of contents selector 34employs a scoring function to rank the candidate tables of contents. Thehighest ranked candidate table of contents is then selected for furtherprocessing. In some embodiments, the scoring function is the sum ofentry weights, where an entry weight is inversely proportional to thenumber of links associated with an entry of the table of contents. Thisentry weight characterizes the certainty of any of its associated links,under the assumption that the more links initiate at a given source textfragment, the less likely that any one of those links is a “true” linkof a table of contents. Other scoring functions can be employed. Forexample, another contemplated scoring function sums the number ofentries in the candidate table of contents. A short candidate table ofcontents is typically less likely to be “correct” than a long table ofcontents in which a large contiguous sequence of text fragments is foundto be capable of satisfying the ordering and non-self-referencingcriteria. Yet another contemplated scoring function characterizes thespan of the target text fragments respective to the entire document 10.If the available target text fragments of a particular candidate tableof contents span only a small portion of the document 10, then it isunlikely that the candidate table of contents is the “true” table ofcontents for that document. These scoring functions are examples, andother scoring functions, or various combinations of scoring functions,can be employed.

Once the highest ranked candidate table of contents has been selected,the links optimizer 38 is applied to that table of contents to producethe final table of contents 110. The selection of the best link for eachof the entries of the table of contents involves finding a globaloptimum for the table of contents while respecting the four table ofcontents constraints: contiguity, text similarity, ordering, andnon-self-referencing. In some embodiments, a weight is associated toeach link, which is proportional to its level of matching. In someembodiments, a Viterbi shortest path algorithm is employed in selectingthe optimized links. Other algorithms can also be employed for selectingthe optimized links.

In some applications, the identified table of contents is used forfurther processing. For example, the table of contents 110 can beemployed to structure the unstructured document 10 by dividing it upinto text sections linked to table of contents entries to produce astructured document. When the table of contents is used for automatedstructuring of the document, it is typically advantageous to strictlyenforce the ordering criterion since cross-linking can lead to errors inthe formatting of the structured document. While the foregoing hasparticularly addressed the example application of identifying a table ofcontents indexing text sections, it is to be appreciated that thedisclosed approaches are also applicable to identifying tables ofcontents indexing other features of a document, such as figures(commonly called a “table of figures”), tables (commonly called a “tableof tables”), equations, and so forth.

Having set forth the example illustrative textual similarity-based tableof contents extractor, some techniques for improving the robustness oftable of contents extraction are set forth with illustrative exampleapplication thereto. It is to be appreciated that the techniquesdisclosed herein are generally applicable to other types of table ofcontents extraction techniques.

In general, the techniques disclosed herein are based on reducing thenumber of text fragments which are candidates for table of contentsentries and/or for identification as linked fragments based on one ormore reduction criteria. Additionally or alternatively, robustness canbe improved by validating the identified table of contents entries andlinked fragments based on one or more validation criteria related todistribution of the linked text fragments.

With reference to FIG. 1, in some embodiments, one or more reductioncriteria 130 are applied by the textual similarity links identifier 20to reduce the number of text fragments that are candidates foridentification as linked text fragments. For example, the reductioncriteria 130 may include one or more regular expressions with which textfragments are compared. Text fragments which match the regularexpression (or, alternatively, which do not match the regularexpression) are excluded as candidates for identification as linked textfragments. For example, the regular expression may set forth an indexingtext fragment portion such as a leading numeric index, a leadingalphabetic index, a leading roman numeral index, or so forth, and textfragments that do not match or satisfy the indexing fragment portiondefined by the regular expression are excluded from consideration ascandidate linked text fragments. This approach is useful where the tableof contents entries are Indexed by, for example, a chapter number (e.g.,Chapter 1, . . . Chapter 2, . . . , etc.), alphabetic section index(e.g., “A. Introduction”, . . . “B. Description of the problem”, . . .etc.), or so forth.

As another example, the regular expression may set forth that the textfragment contain at least one keyword typically indicative of a chapterheading, section heading, or so forth. For example, the keyword may be“part”, “section”, “chapter”, “book”, or so forth, or variouscombinations thereof. Text fragments which do not satisfy the regularexpression because they contain none of the keywords indicative of beinga heading are excluded. In some such regular expressions, the locationof the keyword may be incorporated into the regular expression. Forexample, the regular expression may be something such as: “Chapter *”which indicates that the text fragment must begin with the capitalizedword “Chapter” followed by a space and any other text (as indicated bythe trailing asterisk). In other such regular expressions, theexpression may be satisfied if the keyword appears anywhere in the textfragment.

Other regular expressions can be used, alone or in combination. As yetanother example, the regular expression may require that the textfragment be in all-caps, so that text fragments containing lower-caseletters (or more than one or two lower-case letters, or some othersimilar pattern) are excluded from further consideration by the textualsimilarity links identifier 20. While the term “regular expression” isused herein, it is to be appreciated that the comparison with theregular expression may be computationally implemented in various ways,such as using a text search algorithm (for finding a keyword in a textfragment), a finite state network-based automaton (for performingcomparisons with simple or complex character string patterns), or soforth.

In some embodiments, the one or more reduction criteria 130 include oneor more criteria that are based on page position of the text fragments.For such embodiments to be operative, the source document 10 shouldinclude page layout information (such as a pdf document which is pagelayout-based) and the text fragmenter 12 should detect and associate thepage position information with the text fragments 14. Page informationmay include, for example, vertical page position (denoted by a distancefrom top of page, a line number, or so forth). If the source document 10has a multi-column format, then the page information may include acolumn index for each text fragment indicating which column the textfragment was extracted from.

Some suitable page position based reduction criteria are as follows: (i)limit candidate text fragments for consideration as links to those textfragments within a selected distance from a top of the page (thiscriterion assumes that sections or chapters typically start at the topof a page); and/or (ii) limit candidate text fragments for considerationas links to those text fragments whose associated column indexcorresponds to a leftmost column (this criterion assumes that sectionsor chapters typically start at the leftmost column of a page). Otherpage position based reduction criteria may be applicable depending upona priori knowledge of the document page layout.

Regular expression-based reduction criteria are suitably appliedrespective to reduce the number of candidate linked text fragments, orto reduce the number of table of content entries, or to reduce both thenumber of candidate linked text fragments and the number of table ofcontent entries. Certain criteria may be better suited for reducing thenumber of candidate linked text fragments but not well suited forreducing the number of candidate table of content entries. For example,page layout-based reduction criteria may be applicable to reducing thenumber of candidate linked text fragments, since chapter or sectionheadings are likely to be organized at top of page or so forth, whereassuch page layout-based criteria may be inapplicable to the table ofcontent entries which may form a list of entries that span one or morepages.

In some embodiments, a contiguous sub-set of the text fragments areidentified as a table of contents region 134. For example, a userinterface 136 can be configured to receive a user identification of thetable of contents region. As one example, the user may scan in thesource document 10, and at the time of scanning indicate which scannedpage or pages contain the table of contents. Then, when the textfragmenter 12 fragments the source document 10 to produce the textfragments 14, those text fragments extracted from the page or pagesindicated by the user as containing the table of contents are assignedas the table of contents region 134. In another approach, the table ofcontents region 134 may be defined as a first portion of the document(for example, the first 5% of the total number of text fragments 14, orthe first 10%, or so forth), in recognition of the expectation that thetable of contents is likely to be near the beginning of the document.This approach reduces the number of candidate table of content entriesas well as the number of candidate linked text fragments. A combinationof a restrictive table of contents region 134 that restricts the numberof candidate table of content entries and application of reductioncriteria 130 to reduce the number of candidate linked text fragments canprovide substantial improvement in the speed of table of contentextraction.

However, these techniques (a restrictive table of contents region 134and application of one or more reduction criteria 130) do not addressanother possible problem which can arise, namely the presence ofmultiple copies of the table of contents. Such a situation can arise,for example, if the document is a multi-volume document and each volumecontains a copy of the table of contents. In some document arrangements,only partial copies of the table of contents may be present. Forexample, a multi-volume document may include a main table of contentslisting all chapters or sections, and each volume may include a volumetable of contents listing only those chapters or sections contained inthat volume.

FIG. 5 illustrates a case in which a document contains a main table ofcontents 140, and a copy 142 of the table of contents, for example atthe beginning of a second volume of the document. Because the copy 142contains substantially similar, or possibly even exact, duplicates ofthe entries of the table of contents 140, it is possible, or perhapseven likely, that the table of contents identification process mayerroneously identify as linked text fragments 144 the text fragments ofthe copy 142. FIG. 5 illustrates a “worst-case” scenario in which everyentry of the table of contents 140 is erroneously linked with acorresponding entry of the copy 142.

FIG. 6 illustrates a less dramatic situation in which a documentcontains a table of contents 150, section headings 152, 154(corresponding to sections A and B) in a first volume, a copy 156 of thetable of contents at the start of a second volume, and another sectionheading 158 (“corresponding to section C) in the second volume. Here,the table of content entries for sections A and B are correctlyassociated with linked text fragments corresponding to the sectionheadings 152, 154. However, the table of content entries for sections C,D, and E are erroneously associated with linked text fragments withinthe copy 156 of the table of contents, and the section heading 158(corresponding to section C) is erroneously not identified as a linkedtext fragment. A similar situation can arise where each volume includesa partial copy of the table of contents listing only those chapters orsections contained in that volume.

With returning reference to FIG. 1, to address situations such as theexample situations illustrated in FIGS. 5 and 6, the selected table ofcontents 110 is suitably validated by a table of contents validator 160based on one or more validation criteria 162 that are related todistribution of the linked text fragments. One suitable validationcriterion is the fraction of the document spanned by the linked textfragments. This criterion is effective for identifying situations suchas that of FIG. 5, since the copy of the table of contents is unlikelyto span a very large portion of the linked text fragments. In somesuitable embodiments, the selected table of contents 110 is validated(at least respective to this validation criterion) if the span of thelinked text fragments is greater than a validation fraction threshold ofabout 10% to 20%. Note that this criterion may fail to identifyerroneous situations such as that of FIG. 6, in which only some of thelinked text fragments are erroneously in the copy of the table ofcontents, because in such cases the span of the linked text fragmentsincluding both correct and incorrect linkages may be larger than thevalidation fraction threshold.

Another approach for addressing a situation like that of FIG. 5 in whichthere are one-to-one mappings between the table of contents 140 and thecopy 142 is to invalidate any contiguous group of linked text fragmentsthat have a one-to-one correspondence with a more than a thresholdnumber of contiguous table of content entries. This can be done by thetable of contents validator 160. When such a contiguous group ofone-to-one linked text fragments is found, an invalid links remover 164removes the contiguous group of linked text fragments from the group oftext fragments identified as linked text fragments, and the table ofcontents selector 34 and links optimizer 38 updates the identificationof the selected table of contents 110 including the linked textfragments with that contiguous group of linked text fragments removed orexcluded as candidates for identification as linked text fragments. Thisapproach works well for exact copies of the table of contents such as isshown in FIG. 5, and for partial copies of the table of contents inwhich the partial copy includes a contiguous group of more than thethreshold number of contiguous one-to-one linked text fragments.However, it will not recognize “substantial” copies of the table ofcontents which include gaps or other breaks in the one-to-onecontiguity.

To address situations such as that of FIG. 6, or situations in whichpartial copies of the table of contents are erroneously linked, the oneor more validation criterion 162 suitably include a criterion thatvalidates conditional upon the linked text fragments not being within asubstantial copy of the substantially contiguous group of text fragmentsidentified as table of content entries. To apply this validationcriterion, a suitable technique should be used to search for asubstantial copy of the substantially contiguous group of text fragmentsidentified as table of content entries. The search technique shouldtolerate some differences between the table of contents and the copy toaccount for noise due to document imperfections, erroneous links, holesin the table of contents, and so forth. Additionally, the searchtechnique should be able to identify a partial copy as a substantialcopy of the substantially contiguous group of text fragments identifiedas table of content entries.

In one suitable approach, the search technique compares the table ofcontent entries with a candidate copy using a figure of merit includingat least (i) a measure of textual similarity between the table ofcontent entries and text fragments of the candidate copy and (ii) alongest common string contained in both the table of content entries andthe candidate copy. If the table of contents extractor uses a textualsimilarity-based process (such as the example illustrative textualsimilarity-based table of contents extractor described herein), then thelinks 22 readily provide the measure of textual similarity between thetable of content entries and text fragments of the candidate copy. Fortable of contents extractors employing font-based or other processes,the Jaccard similarity measure employed herein, or another textsimilarity measure, can be used.

The longest common string component sequentially compares pairs offragments. When two fragments are not equal, rather than exploring thetwo possibilities as is typically done in longest common string-typealgorithms, (that is, ignoring one or the other), the one that comesfirst in the page reading order is ignored. This lets copies be foundwith linear complexity and no storage, while tolerating a given maximumnumber of differences per page. In a suitable approach, given a table ofcontents candidate, the N pairs of text fragments formed by the firsttable of content entry and each of its N possible correspondingfragments in the document body are considered. Those N pairs indicatethe potential starts of a copy of the table of contents. This approachassumes that a table of contents candidate is formed by making anhypothesis regarding its start, so the sequence of matching pairs startswith the first entry. (In other embodiments, it is contemplated toconsider all pairs with first pair member being in the M first entriesof the table of contents, where M is the maximum number of differencetolerated. This approach may be more robust). For each of the N pairs,the following iterative processing is performed:

for fragDoc in TOC_Candidate.firstFrag.links: fragToC =TOC_Candidate.firstFrag nbError = 0 While (nbError < nbErrorMax) andnot(out-of-toc or out-of-fragment- list): #Reset the error count foreach new page If (fragToC and fragDoc) are-on-top-of-page: nbError = 0If fragDoc in fragToC.links: #ok, they match pairwise fragDoc =fragDoc.next fragToC = fragToC.next else: #we have a differencenbError++ if fragDoC is-after fragToC: fragToC = fragToC.next else:fragDoc = fragDoc.next # exit of the while loop if (nbError <nbErrorMax) and out-of-toc #We found a copy of the ToC #We will hereremove all the links to the copy from the ToC fragments . . .The above iterative processing compares pairs of fragments in asequential manner, tolerating a given maximum of difference per page.The is-after predicate exploits first the page ordering, then the columnordering and eventually the vertical position to compare the position ofthe two fragments in the document order. A partial copy of the table ofcontents that shares the same starting point can be found in a similarway. To find partial copies not sharing the same starting point as thecomplete table of contents, the same method can be applied starting withall table of content entries starting a new page.

The above longest common string-type figure of merit component assumesthat the two copies of the table of contents share the same segmentationin columns and pages and the same layout on the page. To relax thisconstraint, it a more general longest common string-type figure of meritcomponent can be applied, albeit at typically higher processingcomplexity.

If a copy of the table of contents is found, then the table of contentsvalidator 160 checks whether any of the linked text fragments lie withinthe copy. If so, then these links are presumed to be invalid, and theinvalid links remover 164 removes those linked text fragments locatedwithin the substantial copy from the group of text fragments identifiedas linked text fragments, and the table of contents selector 34 andlinks optimizer 38 updates the identification of the selected table ofcontents 110 including the linked text fragments with those textfragments within the substantial copy removed or excluded as candidatesfor identification as linked text fragments. The process iterates untila table of contents is selected which does not include links to a copyof itself.

Once validated, the selected table of contents 110 is suitably used forvarious purposes. For example, a document organizer 170 can use theextracted table of contents including the a substantially contiguousgroup of text fragments defining the table of content entries and thedifferent group of text fragments defining linked text fragments thatare linked with corresponding table of content entries to organize thedocument into a structured XML, HTML, or otherwise structured document.

It will be appreciated that various of the above-disclosed and otherfeatures and functions, or alternatives thereof, may be desirablycombined into many other different systems or applications. Also thatvarious presently unforeseen or unanticipated alternatives,modifications, variations or improvements therein may be subsequentlymade by those skilled in the art which are also intended to beencompassed by the following claims.

1. a method for identifying a table of contents in a document, the method: comprising: extracting text fragments from the document; identifying (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries; and during the identifying, reducing a number of text fragments that are candidates for identification as linked text fragments based on at least one reduction criterion.
 2. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression, text fragments that one of (i) match, or (ii) do not match, the regular expression being excluded as candidates for identification as linked text fragments.
 3. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth an indexing text fragment portion, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
 4. The method as set forth in claim 3, wherein the initial index-identifying text fragment portion is one of (i) a leading numeric index, (ii) a leading alphabetic index, and (iii) a leading roman numeral index.
 5. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth that the text fragment is capitalized, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
 6. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth that the text fragment contain at least one keyword selected from a group of keywords consisting of at least one of: “part”, “section”, “chapter”, and “book”, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
 7. The method as set forth in claim 6, wherein the regular expression further sets forth that the at least one contained keyword be located in a selected position or range of positions within the text fragment.
 8. The method as set forth in claim 1, wherein the extracting of text fragments includes associating page positions with the text fragments, and the reducing comprises: limiting the candidates for identification as linked text fragments based on the associated page positions.
 9. The method as set forth in claim 8, wherein the associated page positions include at least associated vertical page positions, and the limiting comprises: limiting the candidates for identification as linked text fragments to fragments whose associated vertical page position is within a selected distance from a top of the page.
 10. The method as set forth in claim 8, wherein the associated page positions include at least column indices, and the limiting comprises: limiting the candidates for identification as linked text fragments to fragments whose associated column index corresponds with a leftmost column.
 11. The method as set forth in claim 1, further comprising: structuring the document based on the identified table of content entries and linked text fragments.
 12. A method for identifying a table of contents in a document, the method comprising: extracting text fragments from the document; identifying (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries; and validating the identified table of contents entries and linked text fragments based on at least one validation criterion related to distribution of the linked text fragments.
 13. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon a span of the linked text fragments being greater than a validation fraction threshold of the total span of the extracted text fragments.
 14. The method as set forth in claim 13, wherein the validation threshold fraction is between about 10% and about 20%.
 15. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon there being no group of contiguous linked text fragments numbering greater than a threshold having one-to-one correspondence with a group of contiguous table of content entries.
 16. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon no linked text fragments being within a substantial copy of the substantially contiguous group of text fragments identified as table of content entries.
 17. The method as set forth in claim 16, wherein the validating further comprises: searching for the substantial copy of the substantially contiguous group of text fragments identified as table of content entries by comparing the table of content entries with a candidate copy using a figure of merit including at least (i) a measure of textual similarity between the table of content entries and text fragments of the candidate copy and (ii) a longest common string contained in both the table of content entries and the candidate copy.
 18. The method as set forth in claim 16, wherein the substantial copy is a partial copy of the substantially contiguous group of text fragments identified as table of content entries.
 19. The method as set forth in claim 16, further comprising: conditional upon the validating finding one or more linked text fragments within a substantial copy of the substantially contiguous group of text fragments identified as table of content entries, (i) removing linked text fragments located within the substantial copy from the group of text fragments identified as linked text fragments and (ii) updating the identifying with text fragments within the substantial copy excluded as candidates for identification as linked text fragments.
 20. The method as set forth in claim 12, further comprising: structuring the document based on the identified table of content entries and linked text fragments.
 21. An apparatus for identifying a table of contents in a document, the apparatus comprising: a text fragmenter that extracts text fragments from the document; a table of contents region identifier that identifies a contiguous sub-set of the text fragments as a table of contents region; and a table of content extractor that identifies (i) a substantially contiguous group of text fragments within the table of contents region as table of content entries, and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries.
 22. The apparatus as set forth in claim 21, wherein the table of contents extractor selectively excludes text fragments as candidates for identification as linked text fragments based on comparison with a regular expression.
 23. The apparatus as set forth in claim 21, wherein the table of contents extractor selectively excludes text fragments as candidates for identification as table of content entries based on comparison with a regular expression.
 24. The apparatus as set forth in claim 21, wherein the text fragmenter associates page positions with the extracted text fragments, and the table of contents extractor limits the candidates for identification as linked text fragments based on the associated page positions.
 25. The apparatus as set forth in claim 24, wherein the associated page positions include at least one of (i) vertical page positions and (ii) column indices.
 26. The apparatus as set forth in claim 21, wherein the table of contents region identifier comprises: a user interface configured to receive a user identification of the table of contents region.
 27. The apparatus as set forth in claim 21, further comprising: a document organizer that structures the document based on the identified table of content entries and linked text fragments. 